Library Imports
from pyspark.sql import SparkSession
from pyspark.sql import types as T
Template
spark = (
SparkSession.builder
.master("local")
.appName("Exploring Joins")
.config("spark.some.config.option", "some-value")
.getOrCreate()
)
sc = spark.sparkContext
Create a DataFrame
schema = T.StructType([
T.StructField("pet_id", T.IntegerType(), False),
T.StructField("name", T.StringType(), True),
T.StructField("age", T.IntegerType(), True),
])
data = [
(1, "Bear", 13),
(2, "Chewie", 12),
(2, "Roger", 1),
]
pet_df = spark.createDataFrame(
data=data,
schema=schema
)
pet_df.toPandas()
pet_id | name | age | |
---|---|---|---|
0 | 1 | Bear | 13 |
1 | 2 | Chewie | 12 |
2 | 2 | Roger | 1 |
Background
There are 3 datatypes in spark RDD
, DataFrame
and Dataset
. As mentioned before, we will focus on the DataFrame
datatype.
- This is most performant and commonly used datatype.
RDD
s are a thing of the past and you should refrain from using them unless you can't do the transformation inDataFrame
s.Dataset
s are a thing inSpark scala
.
If you have used a DataFrame
in Pandas, this is the same thing. If you haven't, a dataframe is similar to a csv
or excel
file. There are columns and rows that you can perform transformations on. You can search online for better descriptions of what a DataFrame
is.
What Happened?
For any DataFrame (df)
that you work with in Spark you should provide it with 2 things:
- a
schema
for the data. Providing aschema
explicitly makes it clearer to the reader and sometimes even more performant, if we can know that a column isnullable
. This means providing 3 things:- the
name
of the column - the
datatype
of the column - the
nullability
of the column
- the
- the data. Normally you would read data stored in
gcs
,aws
etc and store it in adf
, but there will be the off-times that you will need to create one.